14.4 Text Classification Using Naïve Bayes
Download the file from the link below to follow along with the text example or video and to practice on your own.
As we learned in the section that discussed conditional probabilities, text document classification using the Naïve Bayes Classifier uses the likelihood of certain words appearing in a specific class of document. But, a logical question is where do these words come from? What is the correct set of words to use for the Naïve Bayes classification process?
The most common way to classify text documents (emails, tweets, posts, etc.) is to use the Naïve Bayes Classifier with the bag-of-words model. In the bag-of-words model, each document to be classified is evaluated as a collection of unordered words. This means that context information such as sentence structure, punctuation, order of words, or word relationships, such as word pairs, is not considered by this model. The only thing evaluated is the appearance and frequency of appearance of specific words.
Reviewing the example of filtering spam emails from the previous section, we first note that a set of documents are used as training data to calculate the likelihoods for specific words. We also note that this training data consists of three bags of words with their associated likelihoods: (1) an overall bag of words from the entire training set, (2) a bag of words for Class A (in our example Normal email), and (3) a bag of words for Class B (in our example Spam email). Thus, the training data must have a set of documents (such as emails) and their associated classes.
Once the training data provides the likelihoods for each word for each class, we can test any new document and determine its most likely class. As noted previously, for the document being tested, if:
Thus, we test the document using the Naive Bayes theorem for each class. If the calculated conditional probability (given word1, word2, word3... is true) of it being Class 1 is greater than the calculated conditional probability of it being Class 2, then we assume it is in Class 1.
The next step is to build the bag of words and calculate the overall likelihoods and the conditional likelihoods (probabilities). One way to build the necessary bags of words is by writing a software program using Python or Java. In this example, however, we will build the probability test using Excel tools. Note: Remember that we use the term likelihood to refer to probabilities that are precalculated.
Determining the Bags of Words and Conditional Probabilities
One of the issues with the Naïve Bayes Classifier with textual documents is the size of the data files available. In some cases, the entire dataset can be used. Often, however, to simplify the process, a random subset of the available data can be used to yield a viable solution. In the email example of an earlier section, we indicated that there were a total of fifty occurrences but using only six words. For actual datasets, there can be hundreds of words from thousands of training documents.
The following sections explain the process when working with actual data. However, while we are learning the steps, we will still use a small and contrived example. For this example we have selected a paragraph of four sentences from a previous section of the book to use as the training data. Out of these four sentences, we will classify the first three sentences as Class1. The final sentence has some special words, such as "likelihoods", "estimated", and "frequency", that are unique. We will define this sentence as belonging to Class2. Again, we recognize that this example is a contrived example, but it is short and can be used to teach the process. These four sentences will be used as the training data. We will also invent a sentence to test for Class1 or Class2.
Remove Irrelevant Punctuation
In a bag of words, punctuation and case are usually not meaningful, so we need to lowercase everything and replace punctuation with spaces. We will use cell A1 as an example containing the text of a document. We use =LOWER(A1) to lowercase all of the text in cell A1. We can use SUBSTITUTE(A1,". "," ") to remove periods with a space after them. The space is useful in case we have URLs or IP addresses that use periods as part of the address so that these periods can be retained. (Note: Sometimes if there is not a period-space combination at the end of a document, an ending period may be missed.) We can do a similar action to replace colons with a space after them: SUBSTITUTE(A1,":"," "). When can also remove other text punctuation strings without spaces such as "?", "!", ";", and ",". Depending on the content of the documents, you may also need to remove parentheses and quote marks.
Figure 14.9 shows the four sentences that are used as the training data. In this figure, the sentences have all capitalization and the punctuation removed. The first three sentences are Class1 and the last sentence is Class2.
Tokenize the Words
The most important step in creating a bag of words is to separate the words in each document. This process is called tokenization. We do this so that we can count how many times each word is used across all documents. In Excel, we need each word of all documents in a single column, or one token (word) from each document per row. We will do this in three steps. First, we will determine the number of rows needed and duplicate the documents over and over in column A. Second, we calculate the beginning position of each word in each document in Column B. Third, we extract each word from each document into Column C.
Step 1: To do this, count the number of words in your largest document and multiply that number by the number of documents in the dataset to get the number of rows you will need to start with. For example, if your largest document had 50 words, and you have a total of 100 documents, you would need 50 × 100 = 5000 rows. (Note: To capture the last word of the longest document, you will need to save one more set of documents, so you will need 5100 rows. The formula looks ahead to the next space and so needs one extra set of rows at the end.) In our example, there are four sentences. The longest sentence has 30 words, so 4 times 30 gives 120 plus 4 rows will be needed.
Copy your cleaned-up set of documents to a new worksheet, repeating the documents over and over again for the number of rows needed. For example, from Figure 14.9, the data is in rows 19 through 22. We copy those rows and then go to a new worksheet and highlight A2 through A125. Then use Paste Special the Values (not the formulas). Since you are pasting 4 original rows into 124 rows, Excel automatically repeats the 4 rows again and again to fill up the space. At this point, you have one document per row with the documents repeated over and over again. This is shown in Figure 14.10. Also note that since we have row one as the header row, all references to rows are one larger than expected.
Step 2: Next we calculate the starting position of each word in each document and place it in Column B. Also note that these values will be interleaved by document. In other words, B2 will contain the location of the first space in the document in A2. B3 will contain the starting location of the first space in document A3. Since the first word in each document starts at 1, we can consider that the first space in each document starts at location 0, so we enter 0 in the first occurrences of rows of the new worksheet (B2 through B5). When the rows start to repeat, at B6, we calculate the next space with =FIND(" ",A6,B2+1). The FIND function returns the position in the string (the document) of what it finds. In this case, it searches the text of doc1 for the next empty space beginning at the character after the previous word starting location referenced in cell B2. In other words, it looks for the space at the end of the word.
This will give an error for documents with fewer than 30 words when we run out of spaces. To resolve this, we wrap the FIND function in an IFERROR function, which will return one plus the text length to show the position after the last word, =IFERROR(FIND(" ",A6,B2+1), LEN(A6)+1), and copy it down the remaining 125 rows of column B. Column B then has the location of the space before each word or token. Again, the values are interleaved with all the first words grouped together and the second words grouped together in adjacent rows, etc.
Step 3: We can now extract the words into column C using the MID function. For example, in C2, the text for the function would be in A2, and the starting position is one past the last space, which is in B2, and the length is the difference between the next space in the same text (in B6) and the current space (in B1), so it would be =MID(A1,B1+1,B101-B1-1). This will also yield an error on rows that run out of words, so we need to wrap an IFERROR around it and return some placeholder text like a period—for example, =IFERROR(MID(A6,B6+1,B10-B6-1),"."), which can be copied down the remaining 125 rows.
We now have in column C all the words in all the documents with one word per document. For those documents that were shorter than the maximum length, there are periods in those cells after the document ended.
Figure 14.10 illustrates the above steps. Cell B6 contains the beginning location of the word “technique” and cell C6 contains the word “technique”. Column D contains the LEN function to give the length of each word. Notice in the final rows (118 through 125) that periods indicate that the routine has finished finding words in a particular document.
Delete Short Words and Count Word Frequency
In the English language, there are many short words that tend to be meaningless in a text document, such as “the” or “and”. We often would like to delete the words (tokens) that are three letters or less. To do this, in column D we can calculate the length of the word, =LEN(C1). Next, create a PivotTable of the data and save it on a new worksheet starting in column A. (Note: You can insert a new Row 1 and provide column headings before creating the PivotTable.) Include in the data Columns C and D. In the PivotTable Builder, drag the word to the row labels, also drag it to values to create a count column, and drag the length column to the filter and deselect lengths of 0, 1, 2, and 3. (Click on the drop-down arrow to get the selection of values.) Only the longer words remain in the list. Figure 14.11 illustrates this PivotTable with the Words and the Count of Words and filtered with the length of four or greater.
Calculate Conditional Probabilities
With a PivotTable of counts of the longer words in column B, we can now calculate the conditional probability or likelihood of each word, p(word). Before calculating probabilities, we want to include a smoothing operation.
Using the entire set, every word will appear at least once. However, when we extract the words for each class, we have a subset of the overall set of words. To be consistent with later calculations, we will add a small number, such as 0.001, to each count. When we are testing with the Naïve Bayes classification, there will be words in the test document that do not appear in the trial data, in either or both Class1 or Class2. Those words will have a likelihood of zero. But multiplying by zero always returns zero, so the test does not return a valid value. For those cases, we give a count of 0.001 so that we have a very small likelihood. We add the same amount to the test data simply for consistency.
Now, in column D we calculate the conditional probability of each word by dividing the count value in column C by the grand total. In our case, we have 35 rows, so we use a calculation such as C4/$C$35. In this example, C35 contains the sum of the words, which is 52.
You will notice that these numbers are all quite small, often in the range of 0.02 or less. When we multiply many of these numbers together, there is the possibility of getting extremely small values, such as 1.0*10-15. This can create the underflow problem that we discussed earlier. In column E, we will take the natural log of the probabilities in column D, such as =LN(D1), and copy down the column. Column E then contains the log of the likelihoods. Figure 14.12 illustrates these steps. The anti-log can be taken and used in the Naïve Bayes algorithm to classify documents.
Remember from our previous example that we need to have three likelihoods, p(word), p(word|Class1), and p(word|Class2). So far in this explanation, we have been using the entire set of documents. We have calculated p(word), which includes both Class1 and Class2. We need to do the same process to calculate p(word|Class1) and p(word|Class2). We use only Class1 documents and only Class2 documents to get the set of words used in each class. We repeat all of the above steps to obtain conditional probabilities, or conditional likelihoods, for Class1 and Class2 words. In other words, we obtain p(word|Class1) and p(word|Class2) for each word in the combined set of words.
Because the previous figures illustrated the steps and process in calculating p(word), we will not show the spreadsheets for calculating p(word|Class1) or p(word|Class2). The process is exactly the same, only the training set only includes words from Class1 and Class2. The resource file for this section does include those worksheets.
Predicting the Class Using the Naïve Bayes Classifier Model
To predict the class for a test document or sentence, we do the same pre-processing steps discussed above, then calculate the document’s probability for each class using the Naïve Bayes theorem and choose the one with the higher probability. For our example, to test if a document is Class1, we can express the Naïve Bayes theorem as follows:
In this example, we will use the complete Naïve Bayes Classifier equation, including the denominator. The value of p(TestDoc) is just the overall likelihood of any document being either Class1 or Class2. In our example of four sentences, a given sentence has a 3/4 (0.75) likelihood of being in Class1 and a 1/4 (0.25) likelihood of being in Class2. For the other probabilities in the equation, p(word1) is found by using VLOOKUP in the overall likelihoods word table, and p(word1|Class1) is found using VLOOKUP in the Class1 likelihoods table. The probability for the TestDoc is calculated for both Class1 and Class2, and the one with the higher probability is the predicted class.
The TestDoc that will be used for this example is the following:
In Excel it is easy to split each document into separate words. Highlight the documents, choose Text to Columns on the Data tab, select Delimited, check the Space and Tab boxes, and click Finish. This tokenizes the words of each document.
To calculate the probabilities, we will use the model likelihood tables as lookup tables, find each token (word) likelihood for each class, and sum the token probabilities for each class for each document. The formula would be something like =vlookup(D2,Class1Lookup,5,FALSE). In this example we are using column 5 in the lookup tables, which is the logarithm of the likelihood. Remember from your college algebra class that to multiply numbers together using logarithms, you add the logarithms together, or subtract the logarithm for division, and then take the anti-log or exponentiation.
But, what about the words not on the lookup table? For these rare events, as previously discussed, we add 0.001 as their count and then divide as usual. So we add an ISNA() function to the previous formula like =IF(ISNA(VLOOKUP(D2,Class1Lookup,5,FALSE)), LN(.001/Class1Lookup!TotalWordCount), (VLOOKUP(D2,Class1Lookup,5,FALSE)).
However, we still need to delete small words. So we wrap the formula above in another IF. =IF(LEN(D2) <= 3, 0, IF(ISNA(VLOOKUP(D2,Class1Lookup,5,FALSE)), LN(/.001/Class1Lookup!TotalWordCount), (VLOOKUP(D2,Class1Lookup,5,FALSE)))
Now we simply add the sentence likelihood logarithm plus the sum of the likelihood logarithms (in the numerator) for Class1 and subtract (from the denominator) the overall word likelihoods to get a total Class1 probability logarithm. We take the anti-logarithm to yield the probability of the TestDoc being in Class1. We do the same for Class2. Then we compare the two probabilities, and the larger probability is the predicted class. Figure 14.13 illustrates all of these steps. The equations are quite long, and you should read through them carefully to see how they work. The resource file provided will also assist you in understanding the steps.
Drawbacks of the Bag of Words Model
Because in the bag-of-words model we tokenize the words, the structure of the sentences is also deleted. Phrases can mean more than the words alone. Take this book review: “The book was a garbage, but I loved it.” Without structure and context this might classified as a negative review. Even though it is a naïve or simplistic model however, the Naïve Bayes Classifier with the bag of words is very powerful and works most of the time.